You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag Building a Corpus


← Back to all posts
Apr 03 2011

Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I dont mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to processat Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

Mar 02 2011

url: /2011/03/what-historians-dont-know-about.html

Feb 01 2011

Im changing several things about my data, so Im going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Jan 31 2011

Open Library has pretty good metadata. Im using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While Im waiting for some indexes to build, that will give a good chance to figure out just whats in these digital sources.

Jan 28 2011

Im trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. Ive been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). Ive avoided blogging the really boring stuff, but Im going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Dec 09 2010

A commenter asked about why I dont improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. Id like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? Im going to think through what I know, but Id love any advice on this because its really outside my expertise.

Dec 04 2010

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researchers control. Ive noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.

Dec 01 2010

Mostly a note to myself:

Nov 28 2010

Most intensive text analysis is done on heavily maintained sources. Im using a mess, by contrast, but a much larger one. Partly, Im doing this tendentiouslyI think its important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.

Nov 14 2010

Its time for another bookkeeping post. Read below if you want to know about changes Im making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princetons supercomputer time, and why I didnt just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Nov 10 2010

Obviously, I like charts. But Ive periodically been presenting data as a number of random samples, as well.  Its a technique that can be important for digital humanities analysis. And its one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its ownits just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dullone university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, theres real meaning embodied in every point, that were far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We cant read everything ourselves, but its good to check up periodicallythats why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

Nov 08 2010

Ive rushed straight into applications here without taking much time to look at the data Im working with. So let me take a minute to describe the set and how Im trimming it.